Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: uneven load among clickhouse shards caused by retry error mechanism #357

Merged

Conversation

nir3c
Copy link
Contributor

@nir3c nir3c commented Aug 2, 2023

Description

fix uneven load among clickhouse shards caused by the retry error mechanism (related to issues #325 and #322 )

The issue happens when ChProxy sends the HTTP call to one of the shards with a retry mechanism.

When the response from the shard is 502 the scope's host swaps with another one to send by it, which causes not to decreasing the initial host's counter and the new host's counter to decrease instead (when we finally decrementing the scope), which could lead to a situation when the new host's counter is 0 to be set to MaxUint32 value (as 0 - 1 equals to the maximum value of uint32).
once this situation happens the new host will not be able to receive any new requests (as its counter value will be always greater than all other hosts' counter, and the balancing function logic for getting the new host is based on the host's counter + penalty) and the host's replica will never be used (same as for the host -> it will always be higher value than all other replicas as we can see in the getReplica function)

Pull request type

Please check the type of change your PR introduces:

  • Bugfix
  • Feature
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • Documentation content changes
  • Other (please describe):

Checklist

  • Linter passes correctly
  • Add tests which fail without the change (if possible)
  • All tests passing
  • Extended the README / documentation, if necessary

Does this introduce a breaking change?

  • Yes
  • No

Further comments

@sigua-cs sigua-cs self-assigned this Aug 3, 2023
@sigua-cs sigua-cs added the bug label Aug 3, 2023
@sigua-cs
Copy link
Collaborator

sigua-cs commented Aug 3, 2023

@nir3c thanks for the PR. Please check failed tests

proxy.go Outdated
Comment on lines 241 to 242
// comment s.host.dec() line to avoid double increment; issue #322
// s.host.dec()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sigua-cs should I remove this comment as I add the decrement in line 255?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you should since your PR fix this legacy quick & dirty (and broken) fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mga-chka
Copy link
Collaborator

mga-chka commented Aug 3, 2023

Thanks for the PR.
FYI, since your PR is linked to the tricky issues we had with the retry mechanism, we will take a bit of time to review it (since we will need multiple reviewer and because of the holidays it will take more time).

Copy link
Collaborator

@Blokje5 Blokje5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great find.


currentHost := s.host

// decrement the current failed host counter and increment the new host
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also a be a good idea to link the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@mga-chka mga-chka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France).
I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.

Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.

assert.Equal(t, 1, int(s.host.load()))

assert.Equal(t, 0, int(erroredHost.counter.load()))
assert.Equal(t, 5, int(erroredHost.penalty))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the penaltySize variable instead of 5, it would make the test more readable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

proxy.go Outdated
Comment on lines 241 to 242
// comment s.host.dec() line to avoid double increment; issue #322
// s.host.dec()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you should since your PR fix this legacy quick & dirty (and broken) fix.

1. add magic number penaltySize for test assert the number of error host's load + penalty result
2. update comments
@nir3c nir3c requested a review from mga-chka September 1, 2023 10:05
@nir3c nir3c closed this Sep 1, 2023
@nir3c nir3c reopened this Sep 1, 2023
@nir3c
Copy link
Contributor Author

nir3c commented Sep 1, 2023

thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France). I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.

Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.

thanks for reviewing the code, I update the code based on the comments you gave me, feel free review it again

@mga-chka
Copy link
Collaborator

mga-chka commented Sep 1, 2023

thanks for the PR, sorry for the long delay (we take a lot of days off in summer in France). I left a few comments but they're very minor. Please tell me we want to fix them so that we don't wait for an extra 2 weeks before we merge the PR.
Once the PR is merged, we will release a new version ASAP so that you can benefit from your fix.

thanks for reviewing the code, I update the code based on the comments you gave me, feel free review it again

We should release a new version containing your fix on Monday.

@mga-chka mga-chka merged commit 7bdfdb5 into ContentSquare:master Sep 1, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

4 participants